{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Named Entity Extraction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Named entity extraction task aims to extract phrases from plain text that correpond to entities.\n",
    "Polyglot recognizes 3 categories of entities:\n",
    "\n",
    "- Locations (Tag: `I-LOC`): cities, countries, regions, continents, neighborhoods,  administrative divisions ...\n",
    "- Organizations (Tag: `I-ORG`): sports teams, newspapers, banks, universities, schools, non-profits, companies, ...\n",
    "- Persons (Tag: `I-PER`): politicians, scientists, artists, atheletes ..."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Languages Coverage"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The models were trained on datasets extracted automatically from Wikipedia.\n",
    "Polyglot currently supports 40 major languages."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "  1. Polish                     2. Turkish                    3. Russian                  \n",
      "  4. Indonesian                 5. Czech                      6. Arabic                   \n",
      "  7. Korean                     8. Catalan; Valencian         9. Italian                  \n",
      " 10. Thai                      11. Romanian, Moldavian, ...  12. Tagalog                  \n",
      " 13. Danish                    14. Finnish                   15. German                   \n",
      " 16. Persian                   17. Dutch                     18. Chinese                  \n",
      " 19. French                    20. Portuguese                21. Slovak                   \n",
      " 22. Hebrew (modern)           23. Malay                     24. Slovene                  \n",
      " 25. Bulgarian                 26. Hindi                     27. Japanese                 \n",
      " 28. Hungarian                 29. Croatian                  30. Ukrainian                \n",
      " 31. Serbian                   32. Lithuanian                33. Norwegian                \n",
      " 34. Latvian                   35. Swedish                   36. English                  \n",
      " 37. Greek, Modern             38. Spanish; Castilian        39. Vietnamese               \n",
      " 40. Estonian                 \n"
     ]
    }
   ],
   "source": [
    "from polyglot.downloader import downloader\n",
    "print(downloader.supported_languages_table(\"ner2\", 3))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Download Necessary Models"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[polyglot_data] Downloading package embeddings2.en to\n",
      "[polyglot_data]     /home/rmyeid/polyglot_data...\n",
      "[polyglot_data]   Package embeddings2.en is already up-to-date!\n",
      "[polyglot_data] Downloading package ner2.en to\n",
      "[polyglot_data]     /home/rmyeid/polyglot_data...\n",
      "[polyglot_data]   Package ner2.en is already up-to-date!\n"
     ]
    }
   ],
   "source": [
    "%%bash\n",
    "polyglot download embeddings2.en ner2.en"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Example"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Entities inside a text object or a sentence are represented as chunks.\n",
    "Each chunk identifies the start and the end indices of the word subsequence within the text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": false
   },
   "outputs": [],
   "source": [
    "from polyglot.text import Text"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "blob = \"\"\"The Israeli Prime Minister Benjamin Netanyahu has warned that Iran poses a \"threat to the entire world\".\"\"\"\n",
    "text = Text(blob)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can query all entities mentioned in a text."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[I-ORG([u'Israeli']), I-PER([u'Benjamin', u'Netanyahu']), I-LOC([u'Iran'])]"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "text.entities"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Or, we can query entites per sentence"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "The Israeli Prime Minister Benjamin Netanyahu has warned that Iran poses a \"threat to the entire world\". \n",
      "\n",
      "I-ORG [u'Israeli']\n",
      "I-PER [u'Benjamin', u'Netanyahu']\n",
      "I-LOC [u'Iran']\n"
     ]
    }
   ],
   "source": [
    "for sent in text.sentences:\n",
    "  print(sent, \"\\n\")\n",
    "  for entity in sent.entities:\n",
    "    print(entity.tag, entity)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "By doing more careful inspection of the second entity `Benjamin Netanyahu`, we can locate the position of the entity within the sentence."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {
    "collapsed": false,
    "scrolled": true
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "WordList([u'Benjamin', u'Netanyahu'])"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "benjamin = sent.entities[1]\n",
    "sent.words[benjamin.start: benjamin.end]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Command Line Interface"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "metadata": {
    "collapsed": false
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      ",               O    \r\n",
      "which           O    \r\n",
      "was             O    \r\n",
      "equalled        O    \r\n",
      "five            O    \r\n",
      "days            O    \r\n",
      "ago             O    \r\n",
      "by              O    \r\n",
      "South           I-LOC\r\n",
      "Africa          I-LOC\r\n",
      "in              O    \r\n",
      "their           O    \r\n",
      "victory         O    \r\n",
      "over            O    \r\n",
      "West            I-ORG\r\n",
      "Indies          I-ORG\r\n",
      "in              O    \r\n",
      "Sydney          I-LOC\r\n",
      ".               O    \r\n",
      "\r\n"
     ]
    }
   ],
   "source": [
    "!polyglot --lang en tokenize --input testdata/cricket.txt |  polyglot --lang en ner | tail -n 20"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Demo"
   ]
  },
  {
   "cell_type": "raw",
   "metadata": {},
   "source": [
    ".. raw:: html\n",
    "   <embed>\n",
    "   <iframe src=\"https://entityextractor.appspot.com/\" width=\"100%\" height=\"225\" seamless></iframe>\n",
    "   </embed>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Citation\n",
    "\n",
    "This work is a direct implementation of the research being described in the [Polyglot-NER: Multilingual Named Entity Recognition](https://sites.google.com/site/rmyeid/papers/polyglot-ner.pdf?attredirects=0&d=1) paper.\n",
    "The author of this library strongly encourage you to cite the following paper if you are using this software."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "```\n",
    "@article{polyglotner,\n",
    "        author = {Al-Rfou, Rami and Kulkarni, Vivek and Perozzi, Bryan and Skiena, Steven},\n",
    "        title = {{Polyglot-NER}: Massive Multilingual Named Entity Recognition},\n",
    "        journal = {{Proceedings of the 2015 {SIAM} International Conference on Data Mining, Vancouver, British Columbia, Canada, April 30 - May 2, 2015}},\n",
    "        month     = {April},\n",
    "        year      = {2015},\n",
    "        publisher = {SIAM}\n",
    "}\n",
    "```"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## References\n",
    "\n",
    "- [Polyglot-NER project page.](https://bit.ly/polyglot-ner)\n",
    "- [Wikipedia on NER](http://en.wikipedia.org/wiki/Named-entity_recognition)."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 2",
   "language": "python",
   "name": "python2"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 2
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython2",
   "version": "2.7.6"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 0
}